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METHOD FOR IDENTIFYING AUTHORIZED USERS USING A SPECTROGRAM AND 

APPARATUS OF THE SAMS 



Field of the Invention 

The present invention relates to speech identification, 
especially to a method for identifying an authorized user using 
a spectrogram and the apparatus of the same. 

Description of the Related Art 

With the development of communications technology, mobile 
phones have become extremely popular. However, there exist 
numerous problems regarding the security of mobile phones. For 
example, an unauthorized user may use a mobile phone without 
any permission, thus causing a loss of the phone's owner. 

To prevent a mobile phone from being used by an 
unauthorized user, a Personal Identification Number (PIN) is 
normally provided. The user is required to enter a password 
when the mobile phone is turned on. The user can use the mobile 
phone only when the submitted password is correct. However, 
this way is troublesome since it requires the user to remember 
their password. Once the user forgets the password or enters 
an incorrect password, the mobile phone locks and the user is 
unable to use it. Furthermore, since an unauthorized user may 
still get the password, the PIN system fail to fully meet the 
requirements of mobile phone security. 

In order to overcome the shortcomings of the above - 
described method, some prior arts use speech identification 
technology to identify authorized users. For example, in U.S. 
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Patent No. 5,913,196, at least two voice authentication 
algorithms are used to analyze the voice of a speaker. 
Furthermore, in U.S. Patent No. 5,499,288, heuristically- 
developed time domain features and spectrum information such 
5 as FFT (Fast Fourier Transform) coefficients are retrieved from 
the voice of a speaker. Then, the second and third features 
are determined based on the primary feature. These features 
are applied to the speech identification process. In U.S. 
Patent No. 5,216,720, LPC (Linear Predictive Coding) analysis 
10 is used to obtain the speech features. Moreover, DTW (Dynamic 
Time Warping) is used to score the distance of submitted speech 

O 

features and reference speech features, 
jfjf To practice the above prior arts requires a complex and 

4* unwieldy hardware structure. The prior arts are not, 

IpJi 15 therefore, feasibly applied to mobile phone technology. 

q SUMMARY OF THE INVENTION 

*f* Accordingly, in order to overcome the drawbacks of the 

-11 prior arts, an object of the present invention is to provide 

r7 20 a method for identifying an authorized user and an apparatus 

s — 

of the same, which identifies a user based on the specific 
spectrogram of various users to determine whether the user is 
authorized . 

Since an inherent difference exists between individuals 
25 in the way they speak and vocal physiology, such as the 
structure of the vocal region, the size of the nasal cavity and 
the feature of the vocal cords, the speech of each person 
contains numerous unique characteristics. The invention 
extracts this unique information from speech by using 
30 spectrogram analysis to identify users. 

According to the present invention, the user is first 
asked to vocalize a spoken password. An endpoint detection 
algorithm is applied to detect the beginning and end points of 
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the speech. As the speech is analyzed, the modified discrete 
Cosine transformation (MDCT) is used to transform the time 
domain information into frequency domain to create the 
spectrogram for the received speech. A fixed dimension feature 
5 vector is computed from the spectrogram. If this vocalization 
is regarded as a training template, it will be stored in a memory 
device such as a RAM and accessed as a reference template. 
Otherwise, the item is regarded as a testing template to 
identify whether the speaker is authorized. A pattern matching 
10 procedure is introduced to compare the testing template and the 
reference template. Similarity can be measured by a distance 
computation. Based on the resulting distance, an acceptance 
or rejection command from the apparatus of this invention can 
be generated. 




BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention can be more fully understood by 
reading the subsequent detailed description in conjunction 
with the examples and references made to the accompanying 
20 drawings, wherein: 

Fig. 1 is a diagram illustrating the method for 
identifying the authorized user of a telephone according to 
this invention; 

Fig. 2 is a block diagram of the apparatus for analyzing 
25 speech according to this invention; 

Fig. 3 is a diagram illustrating the pre-emphasis process 
according to this invention; 

Fig. 4 is a diagram illustrating the process for 
determining a short -time majority magnitude according to this 
30 invention; 

Fig. 5 is a diagram illustrating the process for detecting 
the end point according to this invention; 
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Fig. 6 is a diagram illustrating the method for extracting 
the speech features from the spectrogram according to this 
invention; 

Fig. 7 is a block diagram of the apparatus for identifying 
5 the authorized user using a spectrogram according to this 
invention . 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

This embodiment is described with regard to the user of 
10 a mobile phone. Refer to Fig. 1, the method for identifying 
„ the authorized user of a telephone according to this invention 

y;! includes the steps of: (i) step 100, detecting the end point 

pi 

~l of a speech after the user speaks; (ii) step 110, extracOing 

j= the speech features from the spectrogram of the speech; (iii) 

l y 

g| 15 step 120, determining whether training is needed, if yes, then 
= £ going to step 122, taking the speech features as a reference 

O template and going to step 124 to set a threshold, and then going 

Ll back to the step 100, otherwise going to the next step; (iv) 

step 13 0, matching the patterns of the speech features and the 
M 20 reference template; (v) step 140, computing the distance 
between the speech features and the reference template based 
on the compared result of the step 130 to obtain the distance 
scoring; (vi) step 150, comparing distance scoring with the 
threshold; (vii) step 160, determining whether the user is 
25 authorized according the compared result of the step 150. 
Next, each step is further explained. 

Referring to Fig. 2, the process for detecting the end 
point of a speech includes the steps of: (i) step 200, filtering 
the analog speech signal with a low-pass filter; (ii) step 210, 
30 the signal output from the low-pass filter is digitized by an 
A/D converter and then the digitized signal is sampled at a rate 
of 8 kHz with 8-bit resolution; (iii) step 220, passing the 
samples through a pre-emphasizer to thoroughly model both the 
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lower-amplitude and the higher- frequency parts of the speech; 
(iv) step -230, extracting the majority magnitude to describe 
the characteristic of amplitude; (v) step 240, comparing the 
majority magnitude of each frame and a pre -determined threshold 
5 to determine the beginning and end points of the speech . 

In the step 200, the frequency limitation of the low-pass 
filter is 3500 Hz. 

In this embodiment, since the pre -emphasizing factor a is 
selected to be 31/32, the pre -emphasizing process can be 
10 achieved by the following equation: 
G y(n) = x (n) -ax (n-1) = x (n) - (31/32 ) x (n-1) = x(n)-x(n- 

|J l)+x(n-l)/32 

Ml The process of pre -emphasizing the digital data in step 

fy 220 is as shown in Fig. 3. Wherein x(n) and y(n) are digitized 

15 data, reference numeral 300 indicates a subtraction operation 

"•3 

s and reference numeral 310 indicates an addition operation. 

rf i The pre -emphasized speech data is divided into frames. 

Each frame contains 160 samples (i.e., 20 millisecond) . The 

O parameter called majority magnitude obtained in the step 23 0, 

"20 is extracted to describe the characteristic of the amplitude. 

Referring to Fig. 4, the process of obtaining the majority 
magnitude includes the steps of: (i) step 400, clearing the 

array ary [0] , , ary[127] ; (ii) step 410, determining 

whether the digitized data y(n) belongs to the current frame, 
25 if yes going to the next step, otherwise going to step 430; (iii) 
step 420, updating the array ary[|y(n) |] of y(n), in which 
ary[|y(n) |] = ary[|y(n) |]+1; (iv) step 422, going on the next 
digitized data so that n = n+1, and then going back to the step 
410; (v) step 43 0, obtaining the index value k of the maximums 
30 of the array ary[0] , . . . , ary [127] for each digitized data; (vi) 
step 440, defining the majority magnitude of the i-th frame, 
mmag(i) = k; (vii) step 450, determining whether there is a next 
frame to be processed, if yes going to the next step, otherwise 
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going to the end; (viii) step 452 , performing the calculation 
of next frame and letting i = i + 1, then going back to the step 
400 . 

In the process of extracting the majority magnitude, the 
5 total number of each absolute amplitude level is counted. The 
great majority of absolute amplitude levels is defined as the 
majority magnitude of current frame. The majority magnitude 
is used in this invention to replace the traditional energy for 
saving the computation power. 
10 Referring to Fig. 5, the process for determining the 

beginning and end points of a speech in the step 24 0 includes 
the steps of: (i) step 500, initially setting the threshold to 
20; (ii) step 510, determining whether there is a begin point 

2" being detected, if yes going to step 540, otherwise going to 

I ii J 

01 15 the next step; (iii) step 52 0, determining whether the majority 
magnitudes mmg(i-2), mmg(i-l) and mmg(i) of three adjacent 
~f frames are all larger than the threshold, if yes going to step 

M 53 0, otherwise going to the next step; (iv) step 522, updating 

5j the threshold; (v) step 524, letting i = i + 1 and then going back 

H= 20 to the step 510; (vi) step 530, a begin point being detected; 

(vii) step 532, determining that the begin point is located at 
the i-2-th frame; (viii) step 534, letting k = 0 and then going 
to the step 524; (ix) step 540, letting k = k+1; (x) step 550, 
determining whether k is larger than 10, if yes going to the 
25 next step, otherwise going to the step 540; (xi) step 560, 
determining whether the ma j ority magnitudes mmg ( i -2 ) , mmg(i-l) 
and mmg(i) of three adjacent frames are all smaller than the 
threshold, if yes going to step 570, otherwise going to the next 
step; (xii) step 562, letting i = i + 1 and then going back to 
30 the step 560; (xiii) step 570, an end point being detected; 

(xiv) step 580, determining that the end point is located at 
the i-2-th frame and going to the end. 




6 



Client's ref . : 1^^2/2001-6-14 

File : 0492 -44 05USF/ Ray/Kevin 




In the above process of detecting the end point, the 
threshold of the background noise is initially set to 20. The 
majority magnitude is extracted for each frame. The majority 
magnitude is then compared with the preset threshold to 
5 determine whether the frame is a part of the speech. It 
indicates that the begin point of the speech is detected if the 
majority magnitudes of three adjacent frames are all larger 
than the threshold. Otherwise, the frame is regarded as a new 
event of background noise and the threshold is updated. The 
10 update procedure is carried out by the following equations. 
Q new_threshold = (old__thresholdx 3 l+new_input ) -r32 

W = (old_thresholdx32- 

Bl 

£ old_threshold+new_input ) -j-3 2 

jjl = old_threshold+ (new_input-old_threshold) h-32 

15 The division can be implemented by a shifting operation. 

CI Moreover, based on the assumption that there is at least a 

L=i 300-millisecond duration for a single sample, the detection of 

Jjf the end point starts 10 frames after the beginning frame. It 

M: indicates that the end point of the speech is detected if the 

20 majority magnitudes of three adjacent frames are all smaller 
than the threshold. 

In order to retrieve the speech features from the 
spectrogram, a Princen-Bradley filter bank is used to transform 
the detected speech signal to get its corresponding spectrogram 
25 in this embodiment. The Princen-Bradley filter bank is 
disclosed in "Analysis/Synthesis Filter Bank Design Based On 
Time Domain Aliasing Cancellation, " IEEE Trans, on Acoustics, 
Speech, and Signal Processing, Vol. ASSP-34, No. 5, Oct. 1986, 
pp. 1153-1161, by John P. Princen and Alan Bernard Bradley. 
30 Referring to Fig. 6, the process of retrieving the speech 

features from the spectrogram includes the steps of: (i) step 
600, assuming the frame length K = 256 and the frame rate M = 
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128; (ii) step 610, dividing the detected voice signal into T 
PCM (Pulse Code Modulation) samples x(n) , where n = 0, . . . , T-l; 
(iii) step 620, using the Princen-Bradley filter bank X(k,m) 
to calculate the spectrogram, where k = 0, . . . . , K/2 and m = 
5 0, . . . . , T/M; (iv) step 630, uniformly segmenting the T/M 
vectors into Q segments, and averaging vectors belonging to the 
q-th segment to form a new vector Z (q) = Z (0 , q) , . . . , Z (K/2 , q) ; 
(v) step 64 0, tracking the local peak and determining that 
Z(k,q) is the local peak if Z (k, q) >Z (k+1 , q) and Z(k,q)>Z(k- 
10 l,q) / then setting W(k,q) = 1 for local peak, otherwise setting 
Q W(k,q) = 0 for others, where k = 0, . . . , K/2 and q = 0, . . . , Q-l 

and W is the last feature vector, and going to the end. 
Wf In the above-described process of retrieving speech 

ffj features from the spectrogram, the Princen-Bradley filter bank 

^ 15 is applied to transform the detected speech signal into its 
B _ corresponding spectrogram. Assume that a frame has K PCM 

m samples, and the current frame has M PCM samples overlapped with 

*ll the next frame. In this embodiment, K and M are respectively 

O set to 256 and 12 8. Thus, the k-th band signal of the m-th frame 

'""'20 can be calculated by the following equation. 

Y(k,m) = 2 y(n)h(mM-n+K-l)cos(m7i/2-27i(n+nO)/K) 

Coefficients of the window h( ) can be found in the Table 
XI of the above-described Princen and Bradley paper. Y (m) = 
Y(0,m) , . . . Y(K/2,m) cover the frequency ranges over 0 Hz to 4000 
25 Hz . If the detected speech has a total of T PCM samples, L (=T/M) 
vectors of Y (m) is calculated to represent the spectrogram of 
these T PCM samples. The L vectors Y (m) are then uniformly 
segmented into Q segments. Vectors belonging to the q-th 
segment are averaged to form a new vector Z (q) = Z(0,q) , . . . , 
30 Z(K/2,q). Thereafter, A local peak tracking subroutine is 
performed to mark the local peaks by setting W(k,q) = 1 for local 
peak and setting W(k,q) = 0 for others. A pattern having 
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Q(K/2 + l) bits is thus obtained to represent the spectrogram of 
the detected speech. 

Next, pattern matching and distance computation are 
initiated. The distance scoring dis between the reference 
5 template RW which is made of RW (0) , . . . , RW (Q) and the testing 
template TW which is made of TW(0) , . . . , TW(Q) is calculated by 
the following equation. 

dis = Z |TW(i, j) -RW(i, j) | , where i = 0, . . . , K/2 and j = 0, . . . ,Q. 
Since the value of TW(i,j) and RW(i,j) is either 1 or 0 , 
10 implementation of the above equation can be easily worked out 
by a bit operation. The threshold value in Fig. 1 is pre- 
m determined by an authorized user. If the value dis obtained 

*yf from the above equation does not exceed the threshold, an 

PJ acceptance command is sent. 

T| 15 Referring to Fig. 7, the apparatus to identify authorized 

users by using spectrogram includes a low-pass filter 10, an 
01 A/D converter 20, a digital signal processor 3 0 and a memory 

8=8 device 40. 

The low-pass filter 10 is used to limit the frequency range 
20 of the submitted speech. 

The A/D converter 2 0 is used to convert the analog signal 
of submitted speech to a digital signal . 

The digital signal processor 30 is used to receive the 
digital signal from the A/D converter 2 0 and implements the 
25 operations in each step of Fig. 1. 

The memory device 4 0 is used to store the data of the 
threshold and the reference template, which are required in the 
operations of the digital signal processor 30. 

Finally, while the invention has been described by way of 
30 example and in terms of the preferred embodiment, it is to be 
understood that the invention is not limited to the disclosed 
embodiments. On the contrary, it is intended to cover various 
modifications and similar arrangements as would be apparent to 
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those skilled in the art. Therefore, the scope of the appended 
claims should be accorded the broadest interpretation so as to 
encompass all such modifications and similar arrangements. 
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